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Abstract —Gaussian process regression is a popular method for 
non-parametric probabilistic modeling of functions. The Gaus¬ 
sian process prior is characterized by so-called hyperparameters, 
which often have a large influence on the posterior model and can 
be difficult to tune. This work provides a method for numerical 
marginalization of the hyperparameters, relying on the rigorous 
framework of sequential Monte Carlo. Our method is well suited 
for online problems, and we demonstrate its ability to handle 
real-world problems with several dimensions and compare it 
to other marginalization methods. We also conclude that our 
proposed method is a competitive alternative to the commonly 
used point estimates maximizing the likelihood, both in terms 
of computational load and its ability to handle multimodal 
posteriors. 


I. Introduction 

The Gaussian process (GP) is a non-parametric probabilistic 
model that can be used to model an unknown nonlinear 
function /(•) from observed input data x and (noisy) output 
data y = f{x). No explicit form of /(•) is assumed, but some 
assumptions on /(•) are encoded through the GP prior and 
a mean function mo{x), a covariance function no{x,x'), and 
their so-called hyperparameters 6> G 0. In mathematical terms, 
/ is a priori modeled to be distributed as 

f{x)gv(m0{x),K0{x,x')^, (1) 

i.e., an infinite-dimensional Gaussian distribution. See [1] for 
a more general introduction to GPs. 

The posterior distribution over /(•) given data (^, x) is also 
a GP. This is due to the conjugacy property of the Gaussian 
distribution. The posterior is often greatly infiuenced by the 
choice of hyperparameters 0, which typically are unknown. 
We therefore propose a method to marginalize the hyperpa¬ 
rameters in GPs. Marginalization can be seen as averaging 
over the range of hyperparameters supported by the data and 
by the prior; 0 can be integrated out by treating it as a random 
variable with prior p{0) and likelihood p{y\x^0), giving rise 
to the posterior p{0\y^x) oc p{y\x^ 0)p{0). For example, the 
predictive distribution is computed by 

P{y*\x*,y,x) = J p{y*\x*,y,x,d)p{d\y,x)dd, (2) 

which unfortunately is analytically intractable. However, us¬ 
ing a Monte Carlo method to obtain N (weighted) samples 
of the distribution p{0\y,x), the predictive 
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(a) Gaussian process regression for the data set defined by the red dots, using 
two different point estimates for the hyperparameters, each corresponding to 
a local minimum in (b, left). 
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(b) Left: the (multimodial) hyperparameter posterior conditional on the 9 data 
points. Right: the posterior using the proposed method (which marginalizes 
the hyperparameters, and thus handles the multimodality). 

Fig. 1. A small example illustrating the infiuence of the hyperparameters in 
the GP prior to the posterior estimate. 




distribution (2) can be approximated by 

N 

Ky*\x*,y,x) = (3) 

i=l 

where the weights are normalized, i.e., = 1- 

A common alternative to marginalization is to choose a 
point estimate of 0 using an optimization procedure maximiz¬ 
ing the likelihood p{y\x, 0) (sometimes referred to as empirical 
Bayes). This may be difficult if the likelihood is multimodal. 
See the small toy example in Figure 1 illustrating the robust¬ 
ness of marginalization compared to point estimates. There 
are also situations where point estimates are not sufficient, 
and marginalization is necessary, such as the change point 
detection problem in Section III-C. 

Our contribution is a method for sampling from the hyper¬ 
parameter posterior distribution p{0\y^ x), based on sequential 
Monte Carlo (SMC) samplers [2]. SMC samplers and their 
convergence properties are well studied [3]. 

Several methods have previously been proposed in the liter¬ 
ature for marginalization of the GP hyperparameters: Bayesian 
Monte Carlo (BMC) [4], slice sampling [5], Hamiltonian 






Monte Carlo [6,7], and adaptive importance sampling (AIS) 
[8]. Particle learning which is closely related to SMC has 
been proposed by [9] for this purpose. The work by [9], 
however, is not targeting the hyperparameters directly, and 
makes (possibly restrictive) assumptions on conjugate priors 
and model structure. 

In this paper, we compare our proposed method to some of 
these methods, and apply it to two real-data problems: the first 
demonstrates that marginalization does not have to be more 
computationally demanding than finding point estimates. The 
second example, which deals with a fault detection problem 
from industry, is possible only with an efficient method for 
marginalization. Our proposed method (and all examples) are 
available as Matlab code via the first authors homepage. 

From the experiments, we conclude that the advantages of 
the proposed method are (i) robustness towards multimodal 
hyperparameter posteriors, (ii) simplified tuning (compared to 
some other alternatives), (iii) competitive computational load, 
and (iv) online updating of hyperparameters as the data record 
grows. 


II. Sampling hyperparameters using SMC 


For the numerical marginalization (3), we require N sam¬ 
ples, known as particles, from the posterior. In this section, 
we discuss how to use a SMC sampler [2] to generate such 
a particle system where is the weight of 

particle The underlying idea is to construct a sequence 
of probability distributions ({tto, ... ,7rp}), starting from the 
prior, and ending up in the posterior. The particles are then 
‘guided’ through the sequence. 

To construct a sequence {tto, ..., ttp}, we use the fact that 
p{0\y^x) depends on the data {y^x), by partitioning the data 
points into P disjoint batches and adding them 

sequentially as 7rn{0) oc 6>)p(6>). 

To guide the particles through the smooth sequence 
{tto, ..., TTp}, we will iteratively apply the three steps weight¬ 
ing, resampling and propagation, akin to a particle filter. 

In the weighting step, the ‘usefulness’ of each particle is 
evaluated. To ensure convergence properties, the particles can 
be evaluated as [2, Section 3.3.2] 
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To avoid numerical problems, the particles have to be 
resampled. The idea is to duplicate particle with large weights, 
and discard particles with small weights. 

To propagate the particles from tt^-i to tt^, a 

Metropolis-Hastings (MH) kernel /C : 0 i-A 0 with invariant 
distribution tt^ can be used. The procedure of propagating 
Oji-i (a sample of 7r^_i) to On (a sample of tt^) by /C is as fol¬ 
lows: (i) Sample a new particle 0' from a proposal g(-|6>n-i), 
e.g., a random walk with variance h. (ii) Compensate for 
the discrepancy between tt^ and q by setting On = 0' with 
probability 


a{9n,0') = min |l, 


TTnjO') q{9n\9') 1 


(5) 
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(a) A transition from the prior p(6) to the posterior p(6\y,x) for the 
data in Figure lb, obtained by adding 3 data points in each step to the 
likelihood. The particles are obtained from the SMC sampler. 



(b) GP regression with marginalized hyperparameters from the corre¬ 
sponding posterior, obtained as a by-product from the particles depicted 
in (a). From left to right, 0 data points (i.e., the prior), 3 data points, 6 
data points, and 9 data points. As we formulated the problem, only the 
rightmost hgure is of interest. This illustrates however how this method 
can be used in online problem in a natural way. 

Fig. 2. Illustration of the SMC sampler, as it evolves from the prior (no 
data) to the posterior (all data). 

Algorithm 1 Hyperparameter posterior sampler 

Input: Data {y,x), GP prior, and prior p{0). 

Output: N samples from p{0\y,x) oc p{y\x,0)p{0). 

All statements with superscript (z) are for z = 1,..., A. 

1: Define 7rn{0) = p{yBi,r,\xBi,n,0)p{0) by partitioning the data 
into P batches {Bn}n=i- 
2: Sample Oq^ from p{0) (= 7ro(^)). 

3: for n = 1 to P do 

4: Update weights according to (4). 

5: Resample {0n\wn^}fLi if needed. 

6: for /c = 1 to iC do 

7: Propose 0'^^^ from q{0'\0t~^'^^^)> 

8: Set = 0'^^^ with prob. (5). 

9: end for 

10: end for 


and otherwise On = On-i- To improve the mixing, this 
procedure can be repeated K times. For this, we use the 
notation On-i = 0^ ^ 0}^ ^ O!^ = On- 

We now have an SMC sampler to obtain samples from 
the hyperparameter posterior, summarized in Algorithm 1 and 
illustrated by Figure 2. From the figure, the suitability to online 
applications is clear: If another data point is added to the 
data, the sequence can be extended to tt^ including the new 
data point, and only the transition from tts to tt^ has to be 
performed. 

We make use of the adaptive SMC sampler by [10] in the 
numerical examples to adapt the proposal q automatically. 

The computational cost of Algorithm 1 is in practice gov¬ 
erned by the 2NPK evaluations of the likelihood p{y\x,0). 
Hence, it is important to choose the number of samples N, 
SMC steps P, and MH-moves per SMC-step K sensibly. An 
idea of sensible numbers will be given along with the examples 
in the next section. 
















III. Examples and results 


We consider three examples for demonstrating our proposed 
approach. First, we consider a small simulated example, also 
comparing to alternative sampling methods, and thereafter two 
applications with real-world data. The first real-data example is 
a benchmark problem to compare the marginalization approach 
in Algorithm 1 to the point estimates obtained using optimiza¬ 
tion. In the third example, we illustrate how we can make 
use of our solution within a GP-based online change point 
detection algorithm. To this end, we require marginalization of 
the hyperparameters, so an efficient hyperparameter posterior 
sampler is indeed a key enabler for this. The online nature 
of the problem also fits well to the possibility to update the 
samples in Algorithm 1 online, as discussed in Section II. 

A. Simulated example 

We consider a small problem of 5 data points, and a 
covariance and mean function with 7 hyperparameters in total. 
We begin by considering the problem of marginalizing out 
7 hyperparameters in a GP prior given 5 data points. Here, 
we are interested in comparing the performance of our SMC 
sampler (Algorithm 1) with some popular alternative methods; 
BMC [4], AIS [8], and (deterministic) griding. 



SMC, N = 6, P = 3, K = 2, (0.13 s) 


AIS, N = 5, K = 5, (0.071 s) 


BMC, N = 128, (0.43 s) 
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SMC, N = 47, P = 7, K = 2, (1.6 s) 


AIS, N = 25, K = 25, (1.4 s) 


Grid, N = 128, (0.39 s) 







SMC, N = 101, P = 11, K = 3, (7.3 s) 

AIS, N = 47, K = 47, (4.6 s) 


Grid, N = 2187, (4.8 s) 
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Fig. 3. Comparison between 15 runs of SMC (Algorithm 1), BMC, AIS, 
and griding, as well optimized point estimates. The predictions (mean, solid, 
and 3 standard deviations, dashed) are shown, together with the red data 
points. The number of particles/samples/grid points is denoted by N, while 
K and P are algorithm specific tuning parameters. The mean computation 
time is also shown. All axis are equally scaled. 

The quite ‘messy’ look in most of the plots indicates that the same 
method (with fixed settings) behaves differently on each run, which of course 
is an unwanted effect. However, the SMC sampler is not suffering from this 
problem for N, P, K large enough. This effect should also be expected for 
AIS and BMC, but apparently they need more samples/iterations (and thus 
computing time) than presented here before that effect can be seen. 


The results for 15 runs are presented in Figure 3; it is indeed 
good if the variance between consecutive runs of the same 
algorithm gives similar results. The variations between the 
runs decrease faster for Algorithm 1 than for the comparable 
methods. When the GP prior has few hyperparameters, we 
conclude that the AIS and griding might be competitive 
methods. We have not managed to obtain competitive results 
with BMC for any problem size, but it should be noted 
that the computational load of BMC can be substantially 
decreased if the hyperparameter prior is independent between 
the dimensions. 

The results for the conceptually different point estimates 
are also presented in Figure 3. The initialization point to the 
optimization algorithm is drawn from the prior: although it is 
a deterministic method, it is obviously very sensitive to the 
initialization. 

B. Learning a robot arm model 

We consider the problem of learning the inverse dynamics 
of a seven degrees-of-freedom SARCOS antromorphic robot 
arm [1,11]. We use the same setup as [1, Section 2.5], i.e., a 
non-trivial setting involving 23 hyperparameters. 

To handle the size of the data set (44 484 training and 4 449 
test data points), we make use of a subset of: (i) datapoints and 
(ii) regressors as discussed by [1, Section 8.3.1]. To use our 
method, we sample the hyperparameters from the posterior 
with a subset of m data points. For comparability, we have 
also reproduced the results using point estimates from [1]. 
The results are reported in Table I. For Algorithm 1, = 15, 

P = 20 and K = b was used. The priors to the logarithms 
of the length-scale and the signal variance are A/'(3,3), and 
A/'(l, 1) for the noise variance. 


TABLE I 

Results for the SARCOS example in Section III-B. 


Method 

m 

SMSE (xl0“^) 

MSLL 

Time (s) 

Subset oe datapoints 



Point est. 

256 

8.36 ± 0.80 

-1.38 ± 0.04 

6.8 

SMC 

256 

8.10 ± 1.32 

-1.38 ± 0.56 

7.1 

Point est. 

512 

6.36 ± 1.13 

-1.51 ±0.05 

26.4 

SMC 

512 

6.13 ± 0.91 

-1.49 ± 0.04 

22.3 

Point est. 

1024 

4.31 ± 0.16 

-1.66 ± 0.02 

101 

SMC 

1024 

4.54 ± 0.33 

-1.61 ± 0.03 

92.5 

Point est. 

2048 

2.99 ±0.08 

-1.78 ±0.03 

423 

SMC 

2048 

3.33 ± 0.28 

-1.69 ± 0.06 

405 

Subset oe regressors 



Point est. 

256 

3.67 ±0.17 

-1.63 ±0.02 

6.8 

SMC 

256 

3.55 ± 0.28 

-1.65 ± 0.05 

7.1 

Point est. 

512 

2.77 ± 0.44 

-1.79 ± 0.07 

26.4 

SMC 

512 

2.89 ± 0.20 

-1.77 ± 0.03 

22.3 

Point est. 

1024 

2.03 ±0.11 

-1.95 ±0.03 

101 

SMC 

1024 

2.00* 

-1.95* 

92.5 


Table I presents the results in the same way as [1, Table 
8.1]. SMSE is the standardized mean square error (i.e., mean 
square error normalized by the variance of the target), and 
MSLL is the mean standardized log loss; 0 if predicting using 
a Gaussian density with mean and variance of the training 
data, and negative if ‘better’. The time is referring to the 
time required to sample and optimize the hyperparameters, 
respectively (not including the test evaluation). Numerical 
problems were experienced for large m, therefore * indicates 
runs where no interval can be reported. 

Table I indicates no significant difference between the 
performance of our method and point estimates. It is however 
worth also to note the computational load: As Algorithm 1 
apparently makes an equally good job in finding relevant 
hyperparameters as the optimization, it is a confirmation that 
our proposed method is indeed a competitive alternative to 
point estimates even for large problems. 



























IV. Conclusion 





(a) Measurements of dissolved oxygen (in mg/1) in a bioreactor with a 
sampling period of 15 minutes. The indicated change points are marked 
in red. Especially as the algorithm is fully Bayesian, the outcome is one 
probability distribution per data sample. This is comprehensively illustrated 
as the occurrence of change points in ‘backwards simulations’ through these 
distributions. A more intensive red color is a more likely change point. 



Time (hours) 


(b) Thresholding of Figure (a), with GP regression in each obtained seg¬ 
ment. The different characteristics in different segments are possible due to 
marginalization of the hyperparameters. 

Fig. 4. Results for the GP-based change point detection. 

C Fault detection of oxygen sensors 

We now consider data from the wastewater treatment plant 
Kappalaverket, Sweden. An oxygen sensor measures the dis¬ 
solved oxygen (in mg/1) in a bioreactor, but the sensor gets 
clogged because of suspended cleaning. The identification of 
such events is relevant to the control of wastewater treatment 
plants [12]. We apply the GP-based online change point 
detection algorithm by [7], where the hyperparameters are 
marginalized using our proposed method. 

The GP-based change point detection presented by [7] can 
be summarized as follows: If data yi^T undergo a change 
at time r, it is of interest to (online) detect r, i.e., esti¬ 
mate p{r\yi:t). The algorithmic idea is a recursive message 
passing scheme, updating the probability p(rt,//i:t), where 
Tt G is the last change point at time t. 

To make predictions using a GP model, the hyperparameters 
either have to be fixed across all data segments, or marginal¬ 
ized. As it is not relevant to use fixed hyperparameters, 
an efficient sampling algorithm is a key enabler in solving 
this problem. The consecutive predictions p{yt\rt-i,yrt:t-i) 
and p{yt-\-i\rt-i,yrt:t) are both needed for the algorithm, 
hence our approach fit this problem well, as discussed in 
Section II. We used N = 2b particles. On average, sampling 
the hyperparameters, i.e., one run of Algorithm 1, took 0.55 
seconds on a standard desktop computer. 

The results are presented in Figure 4a. The expected points, 
suspension and resuming of the cleaning, are indeed indicated. 
An interpretation of the result is obtained by converting the 
results to point estimates by thresholding, and plotting at the 
GP regression for each individual segment, see Figure 4b. 

Note the data-driven nature of the algorithm, as no explicit 
model of the sensor was used at all. The tuning parameters 
are the covariance and mean functions, the prior of the change 
points and the hyperparameter priors. 


We have proposed and demonstrated an SMC-based method 
to marginalize hyperparameters in GP models. The observed 
benefits are robustness towards multimodal posteriors (Fig¬ 
ure 1) and a competitive computational load (Section III-B), 
also compared to the commonly used point estimates of 
the hyperparameters. We have been able to cope with a 
hyperparameter space of dimension 23 (Section III-B), and 
also concluded a sound convergence behavior (Section III). 
Finally, the online update of the hyperparameters has been 
shown useful within the industry-relevant data-driven fault 
detection application (Section III-C). As a future direction, 
it would be interesting to apply our method to the challenging 
GP optimization problem of system identification [13]. 
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